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Abstract 

The ability to accurately model a sentence at vary¬ 
ing stages (e.g., word-phrase-sentence) plays a 
central role in natural language processing. As 
an effort towards this goal we propose a self- 
adaptive hierarchical sentence model (AdaSent). 
AdaSent effectively forms a hierarchy of represen¬ 
tations from words to phrases and then to sentences 
through recursive gated local composition of adja¬ 
cent segments. We design a competitive mecha¬ 
nism (through gating networks) to allow the rep¬ 
resentations of the same sentence to be engaged 
in a particular learning task (e.g., classification), 
therefore effectively mitigating the gradient vanish¬ 
ing problem persistent in other recursive models. 
Both qualitative and quantitative analysis shows 
that AdaSent can automatically form and select the 
representations suitable for the task at hand dur¬ 
ing training, yielding superior classification per¬ 
formance over competitor models on 5 benchmark 
data sets. 


1 Introduction 


The goal of sentence modeling is to represent the meaning of 
a sentence so that it can be used as input for other tasks. Pre¬ 
viously, this task was often cast as semantic parsing, which 
aims to find a logical form that can describe the sentence. 
With recent advances in distributed representations and deep 
neural networks, it is now common practice to find a vectorial 
representation of sentences, w hich turns ou t to be quite effec¬ 
tive for tasks of classification [Kim, 2014|, m achine transla¬ 
tion Iciio et al, 2Q14[|Bahdanau et al, 2015| , and semantic 
matching |Hu et aU, 2014| . 

Perhaps the simplest method in this direction is the con¬ 
tinuous Bag-of-Words (cBoW), where the representations 
of sentences are obtained by global pooling (e.g, average¬ 
pooling or max-pooling) over their word-vectors. The word- 
vectors, also known as word-embedding, can be determined 
in either supervised or unsupervised fashion. cBoW, although 
effective at capturing the topics of sentences, does not con¬ 
sider the sequential nature of words, and therefore has dif¬ 
ficulty capturing the structure of sentences. There has been 


a surge of sentence models with the order of words incor¬ 
porated, mostly based on neural netw orks of various forms, 
including recursive neural networks ISocher et al, 201Q[ 


ISocher et al, 2012[|Socher et al, 2013|, recurr ent neural net¬ 

work prsoy and Cardie, 2015|, and convolu - 

tion neural network t^lchbremer^r al, 2014 Kim, 2Q14| . 
These works apply levels of non-linear transformations to 
model interactions between words and the structure of these 
interact ions can also be le arned on the fly through gated net¬ 
works |Cho et al, 2014| . However these models output a 
fixed length continuous vector that does not retain interme¬ 
diate information obtained during the composition process, 
which may be valuable depending on the task at hand. 


In this paper, we propose a self-adaptive hierarchical sen¬ 
tence model (AdaSent). Instead of maintaining a fixed- 
length continuous vectorial representation, our model forms 
a multi-scale hierarchical representation. AdaSent is inspired 
from t he gated recursiv e convolutional neural network (gr- 
Conv) | |Cho etal, 2014| in the sense that the information fiow 
forms a pyramid with a directed acyclic graph structure where 
local words are gradually composed to form intermediate rep¬ 
resentations of phrases. Unlike cBoW, recurrent and recur¬ 
sive neural networks with fixed structures, the gated nature of 
AdaSent allows the information fiow to vary with each task 
(i.e., no need for a pre-defined parse tree). Unlike grConv, 
which outputs a fixed-length representation of the sentence at 
the top of the pyramid, AdaSent uses the intermediate repre¬ 
sentations at each level of the pyramid to form a multiscale 
summarization. A convex combination of the representations 
at each level is used to adaptively give more weight to some 
levels depending on the sentence and the task. Fig. [^illus¬ 
trates the architecture of AdaSent and compares it to cBoW, 
recurrent neural networks and recursive neural networks. 


Our contributions can be summarized as follows. First, 
we propose a novel architecture for short sequence modeling 
which explores a new direction to use a hierarchical multi¬ 
scale representation rather than a fiat, fixed-length represen¬ 
tation. Second, we qualitatively show that our model is able 
to automatically learn the representation which is suitable for 
the task at hand through proper training. Third, we conduct 
extensive empirical studies on 5 benchmark data sets to quan¬ 
titatively show the superiority of our model over previous ap¬ 
proaches. 























The sentence representation h is then the hidden vector ob¬ 
tained at the last time step, hr, which summarizes all the past 
words. The composition dynamics in recurrent neural net¬ 
works can be described by a chain as in Fig.[2a| 



the cat sat on the mat 


Figure 1: The overall diagram of AdaSent (better viewed in 
color). Flows with green and blue colors act as special cases 
for recurrent neural networks and recursive neural networks 
respectively (see more details in Sec. 3.2). Each level of 
the pyramid is pooled and the whole pyramid reduces into 
a hierarchy which is then fed to a gating network and a 
classifier to form an ensemble. 

2 Background 

Let xi:T denote the input sequence with length T. Each token 
Xt ^ xi-T is a y dimensional one-hot binary vector to encode 
the ith word, where V is the size of the vocabulary. We use 
1/ G to denote the word embedding matrix, in which 

the jth column is the d-dimensional distributed representation 
of the jth word in the vocabulary. Hence the word vectors for 
the sequence xi:t is obtained by = Uxi:t. 

In the cBoW sentence model, the representation h for xi-^ 
is obtained by global pooling, either average pooling (Eq. 
or max pooling (Eq.[^, over all the word vectors: 

= ( 1 ) 

t=l t=l 

hj = max •, jf = 1,..., d (2) 

It is clear that cBoW is insensitive to the ordering of words 
and also the length of a sentence, hence it is likely for two dif¬ 
ferent sentences with different semantic meanings to be em¬ 
bedded into the same vector re presentation. 

Recurrent neural networks |Elman, are a class of 

neural networks where recurrent connections between in¬ 
put units and hidden units are formed through time. The 
sequential nature of recurrent neural networks makes them 
applicable to var ious sequential gener ation tasks, e.g., lan- 
gua ge modeling |Mikolov et al, 2010| and m achine transla¬ 
tion iBahdanau al, 2015HCho et al ., 

Given a sequence of word vectors the hidden layer 
vector ht at time step t is computed from a non-linear trans¬ 
formation of the current input vector and the hidden vector 
at the previous time step ht-i. Let W be the input-hidden 
connection matrix, H be the recurrent hidden-hidden con¬ 
nection matrix and b be the bias vector. Let /(•) be the 
component-wise non-linear transformation function. The dy¬ 
namics of recurrent neural networks can be described by the 
following equations: 

(ho 0 

1 ht = fiWh° + Hht-i + b) 


(a) Composition process in a recurrent neural network. 



(b) Composition process in a recursive neural network. 

Figure 2: Composition dynamics in recurrent and recursive 
neural networks. The one-hot binary encoding of word se¬ 
quences is first multiplied by the word embedding matrix U 
to obtain the word vectors before entering the network. 


Recursive neural networks build on the idea of compos¬ 
ing along a pre-defined binary parsing tree. The leaves of 
the parsing tree correspond to words, which are initialized 
by their word vectors. Non-linear transformations are recur¬ 
sively applied bottom-up to generate the hidden representa¬ 
tion of a parent node given the hidden representations of its 
two children. The composition dynamics in a recursive neu¬ 
ral network can be described as = /(IEl^z + Wuhr + h), 
where h is the hidden representation for a parent node in the 
parsing tree and hi , hr are the hidden representations for the 
left and right child of the parent node, respectively. Wl, Wr 
are left and right recursive connection matrices. Like in re¬ 
current neural networks, all the parameters in recursive neu¬ 
ral networks are shared globally. The representation for the 
whole sentence is then the hidden vector obtained at the root 


of the binary parsing tree. An example is shown in Fig. 2b 


Although the composition process is nonlinear in recur¬ 
sive neural network, it is pre-defined by a given binary pars¬ 
ing tre e. Gated recursive convolutional neural network (gr- 
Conv) | |Cho et JI7 extends recursive neural network 

through a gating mechanism to allow it to learn the structure 
of recursive composition on the fiy. If we consider the compo¬ 
sition structure in a recurrent neural network as a linear chain 
and the composition structure in a recursive neural network as 
a binary tree, then the composition structure in a grConv can 
be described as a pyramid, where word representations are lo¬ 
cally combined until we reach the top of the pyramid, which 
gives us the global repres entation of a who le sentence. We 
refer interested readers to jCho et al, 2014| for more details 
about grConv. 










































3 Self-Adaptive Hierarchical Sentence Model 

AdaSent is inspired and built based on grConv. AdaSent dif¬ 
fers from grConv and other neural sentence models that try to 
obtain a fixed-length vector representation by forming a hi¬ 
erarchy of abstractions of the input sentence and by feeding 
the hierarchy as a multi-scale summarization into the follow¬ 
ing classifier, combined with a gating network to decide the 
weight of each level in the final consensus, as illustrated in 

Fig-B 

3.1 Structure 

The structure of AdaSent is a directed acyclic graph as shown 
in Fig. B Foi* an input sequence of length T, AdaSent is a 
pyramid of T levels. Let the bottom level be the first level and 
the top level be the Tth level. Define the scope of each unit in 
the first layer to be the corresponding word, i.e., scope(/ij) = 
G 1 : T and for any t > 2, define scope(/ip = 

scopeU scope(/i^^^). Then the tth level in AdaSent 
contains a layer of T — t -f-1 units where each unit has a scope 
of size t. More specifically, the scope of is 
Intuitively, for the sub-pyramid rooted at , we can interpret 
hj as a top level summarization of the phrase in the 

original sentence. For example, in Fig.^can be viewed as 
a summarization of the phrase on the mat. 



Figure 3: Composition dynamics in AdaSent. The jth unit on 
the tth level is an intermediate hidden representation of the 
phrase in the original sentence. All the units on the 

tth level are then pooled to obtain the tth level representation 
in the hierarchy H. 

In general, units at the tth level are intermediate hidden 
representations of all the consecutive phrases of length t in 
the original sentence (see the scopes of units at the 3rd level 
in Fig. ^foT an example). There are two extreme cases in 
AdaSent: the first level contains word vectors and the top 
level is a global summarization of the whole sentence. 

Before the pre-trained word vectors enter into the first level 
of the pyramid, we apply a linear transformation to map word 
vectors from to with D > d. That way we can allow 
phrases and sentences to be in a space of higher dimension 
than words for their richer structures. More specifically, the 
hidden representation at the first level of the pyramid is 

= U'h^T = U'U^i,T ( 4 ) 

where U' G is the linear transformation matrix in 

AdaSent and U G is the word-embedding matrix 


trained with a large unlabeled corpus. Equivalently, one can 
view U = U'U G as a new word-embedding ma¬ 

trix tailored for AdaSent. This factorization of the word¬ 
embedding matrix also helps to reduce the effective number 
of parameters in our model when d D. 


3.2 Local Composition and Level Pooling 

The recursive local composition in the pyramid works in the 
following way 

hj = tcih^j ^ ( 5 ) 

h] = f{WLh*-^ + WRh]-\ + bw) (6) 

where j ranges from 1 to T — f + 1 and t ranges from 2 to 
T. Wl,Wr e are the hidden-hidden combination 

matrices, dubbed recurrent matrices, and bw ^ R^ is a bias 
vector, uji, ujr and ujc are the gating coefficients which sat¬ 
isfy LUi^LOr^uJc > 0 and cc/ + + cCc = 1. Eq. [^provides 

a way to compose the hidden representation of a phrase of 
length t from the hidden representation of its left t — 1 prefix 
and its right t — 1 suffix. The composition in Eq. ^includes 
a non-linear transformation, which allows a fiexible hidden 
representation to be formed. The fundamental assumption 
behind the structure of AdaSent is then encoded in Eq. B the 
semantic meaning of a phrase of length f is a convex com¬ 
bination of the semantic meanings of its t — 1 prefix, t — 1 
suffix and the composition of these two. For example, we ex¬ 
pect the meaning of the phrase the cat to be expressed by 
the word cat since the is only a definite article, which does 
not have a direct meaning. On the other hand, we also hope 
the meaning of the phrase not happy to consider both the 
functionality of not and also the meaning of happy. We 
design the local composition in AdaSent to make it fiexible 
enough to catch the above variations in language while let¬ 
ting the gating mechanism (the way to obtain uji , ujr and uJc) 
adaptively decide the most appropriate composition from the 
current context. 

Technically, when computing uj^uJc and uJr are 

parametrized functions of and such that they can 
decide whether to compose these two children by a non-linear 
transformation or simply to forward the children’s represen¬ 
tations for future composition. For the purpose of illustration, 
we use the s o f tmax function to implement the gating mech¬ 
anism during the local composition in Eq.[7] But note that we 
are not limited to a specific choice of gating mechanism. One 
can adopt more complex systems, e.g., MLP, to implement 
the local gating mechanism as long as the output of the sys¬ 
tem is a multinomial distribution over 3 categories. 


[wi\ 

( I = softmax(GL/i5 ^ + Gnh^jj^ -f ho) (7) 

\WcJ 

Gl, Gr g and bo G R^ are shared globally inside the 

pyramid. The s of tmax function over a vector is defined as: 


3oftmax(v) = 


Ei=i exp(t>i) 


/exp(vi)\ 

yexp(vi) J 


V G 


( 8 ) 





















Local compositions are recursively applied until we reach the 
top of the pyramid. 

It is worth noting that the recursive local composition in 
AdaSent implicitly forms a weighted model average such that 
each unit at layer t corresponds to a convex combination of 
all possible sub-structures along which the composition pro¬ 
cess is applied over the phrase of length t. This implicit 
weighted model averaging makes AdaSent more robust to lo¬ 
cal noises and deteriorations than recurrent nets and recur¬ 
sive nets where the composition structure is unique and rigid. 
Fig. 1^ shows an example when t = 3. 



Figure 4: The hidden vector obtained at the top can be de¬ 
composed into a convex combination of all possible hidden 
vectors composed along the corresponding sub-structures. 


Once the pyramid has been built, we apply a pooling op¬ 
eration, either average pooling or max pooling, to the tth 
level, t G 1 : T, of the pyramid to obtain a summariza¬ 
tion of all consecutive phrases of length t in the original sen¬ 
tence, denoted by (see an example illustrated in Fig. for 
the global level pooling applied to the 3rd level in the pyra- 
mid). It is straightforward to verify that corresponds to 
the representation returned by applying cBoW to the whole 
sentence, - then forms the hierarchy in 

which lower level summarization in the hierarchy pays more 
attention to local words or short phrases while higher level 
summarization focuses more on the global interaction of dif¬ 
ferent parts in the sentence. 


3.3 Gating Network 

Suppose we are interested in a classification problem, one can 
easily extend our approach to other problems of interests. Let 
5 f(-) be a discriminative classifier that takes G as input 
and outputs the probabilities for different classes. Let rc(') be 
a gating network that takes G t = 1,..., T as input 
and outputs a belief score 0 < 7t < 1. Intuitively, the belief 
score 7 t depicts how confident the tth level summarization in 
the hierarchy is suitable to be used as a proper representation 
of the current input instance for the task at hand. We require 
It > 0, Vi and = 1- 

Let C denote the categorical random variable correspond¬ 
ing to the class label. The consensus of the whole system is 
reached by taking a mixture of decisions made by levels of 


summarizations from the hierarchy: 

T 

p{C = c | xi : t ) = '^p{C = cl^Hx = t) ■ = t\xi,T) 

t=l 

T 

= Y^g{¥)-w{¥) (9) 

t=l 

where each 5 f(') is the classifier and w{') corresponds to the 
gating network in Fig.[2 


3.4 Back Propagation through Structure 

We use back propa gation through structure (BPTS) |Goller 
land Kuchler, 1996| to compute the partial derivatives of the 
objective function with respect to the model parameters. Let 
£(•) be our scalar objective function. The goal is to derive the 
partial derivative of C with respect to the model parameters in 
AdaSent, i.e., two recurrent matrices, Wl, Wr and two local 
composition matrices Gl , Gr (and their corresponding bias 
vectors): 


dC 

dW^ 


T-t+l 

E 

t=i j=i 


=E 


d£ 


dh] 


dC 


dh] BWl ’ dWR 


T T-t+1 

=E E 

t=i 


dC 


dh] 


. ^ dh]dWR 
j = l 3 

( 10 ) 


The same analysis can be applied to compute and . 
Taking into account the DAG structure of AdaSent, we can 
compute recursively in the following way: 


dC 

dh] 


dL 


dh]+^ 


dh^^ 


dh] 


+ 


dC dh]t\ 
dh]+_\ 


dh] 


Now consider the left and right local BP formulations: 

dhr-\ 


dh] 

dh]'^^ 

dh] 


= ojrl + ujcdiag{f)WR 


= OJII + io,diag{f)WL 


( 11 ) 


( 12 ) 


(13) 


where / is the identity matrix and diag(/') is a diagonal 
matrix spanned by the vector /', which is the derivative of 


/(•) with respect to its input. The identity matrix in Eq. 12 


and Eq. plays the same role as the linea r unit recurrent 
connection in the m emory block of LSTM |Hochreiter and 


ISchmidhuber, 199^ to allow the constant error carousel to 
effectively prevent the gradient vanishing problem that com¬ 
monly exists in recurrent neural nets and recursive neural 
nets. Also, the local composition weights cc/, ujr and ujc in 
Eq. 12 an d Eg. 13 have the same effect as the forgetting gate 
in LSTM |Gers etal, 200Q| by allowing more flexible credit 
assignments during the back propagation process. 


4 Experiments 

In this section, we study the empirical performance of 
AdaSent on 5 benchmark data sets for sentence and short 
phrase classification and then compare it to other competi¬ 
tor models. We also visualize the representation of the input 
sequence learned by AdaSent by projecting it in a 2 dimen¬ 
sional space using PCA to qualitatively study why AdaSent 
works for short sequence modeling. 






























4.1 Experimental Setting 


Statistics about the data sets used in this paper are listed in 
Table [T] We describe each data set in detail below: 

1. MR. Movie reviews jPang and Lee, 200^ data set 
where each instance is a sentence. The objective is to 
classify each review by its overall sentiment polarity, ei¬ 
ther positive or negative. 

2. CR. Annotated custo mer reviews of 14 products ob¬ 
tained from Amazon I Hu and Liu, 2004| p[ The task is 
to classify each customer review into positive and nega¬ 
tive categories. 

3. SUBJ. Subjectivity data set where the goal is to clas¬ 
sify each instance (snippe t) as being subjective or objec¬ 
tive |Pang and Lee, 2004|. 

4. MPQA. Phrase level opinion polarity detec tion subtask 
of the MPQA data set I Wiebe et al, 2005 P 


5. 


TREC. Question data set, in which the goal is to clas 
sify an instan ce (question) into 6 different types (Li and 
Roth, 2002r 


Data 

N 

dist(-F,-) 

K 

w 

test 

MR 

10662 

(0.5, 0.5) 

2 

18 

CV 

CR 

3788 

(0.64, 0.36) 

2 

17 

CV 

SUBJ 

10000 

(0.5, 0.5) 

2 

21 

CV 

MPQA 

10099 

(0.31,0.69) 

2 

3 

CV 

TREC 

5952 

(0.1,0.2,0.2,0.1,0.2,0.2) 

6 

10 

500 


Table 1: Statistics of the five data sets used in this paper. N 
counts the number of instances and dist lists the class dis¬ 
tribution in the data set. K represents the number of target 
classes. |w| measures the average number of words in each 
instance, test is the size of the test set. For datasets which 
do not provide an explicit split of train/test, we use 10-fold 
cross-validation (CV) instead. 


We compare AdaSent with different methods listed below on 
the five data sets. 

1. NB-SVM and MNB. Naive Bayes SVM and M ultino¬ 
mial Naive Bayes w ith uni and bigram features jWang 
and Manning, 2012| . 

2. RAE and M V-RecNN. Recursive autoencoder ISocherl 
et al, 20TT| and Matrix -vector recursive neural net- 
work ISocher et al, 2012| . In these two models, words 
are gradually composed into phrases and sentence along 


4. 


a binar y parse tree. 

CNN iKim, 201^ and DCNN |Kalchbrenner et al.,\ 
|2014| . Convolutional neural network for sentence mod¬ 
eling. In DCNN, the author applies dynamic /c-max 
pooling over time to generalize the original max pool¬ 
ing in traditional CNN. 

P.V.. Paragraph Vector | Le and Mikolov, 2014| is an 
unsupervised model to learn distributed representations 
of words and paragraphs. We use the public implemen- 


^ https://www.es.cornell.edu/people/pabo/movie-review-data/ 
^http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html 
^ http ://mpqa.cs .pitt. edu/ 
"^http://cogcomp.cs.illinois.edu/Data/QA/QC/ 


6. 


tation of P. v|^ and use logistic regression on top of the 
pre-trained paragraph vectors for prediction. 
cBoW. Continuous Bag-of-Words model. As discussed 
above, we use average pooling or max pooling as the 
global pooling mechanism to compose a phrase/sentence 
vector from a set of word vectors. 

RNN, BRNN. Recurrent neural networks and bidirec¬ 
tional recurrent neural networks [ [Schuster and Paliwal, 
19971 . For bidirecti onal recurrent ne ural networks, the 


reader is referred to I jLai et < 2 /., 2015| for more details. 
GrCo nv. Gated recurs ive convolutional neural net¬ 
work i Cho et al, 2014| shares the pyramid structure 
with AdaSent and uses the top node in the pyramid as a 
fixed length vector representation of the whole sentence. 


4.2 Training 

The difficulty of training recurrent neural networks is largely 
due to the no torious gradient exploding and gradient va nish- 
ing problem |Bengio et al, 1994 , Pascanu et al, 2013| . As 
analyzed and discussed before, the DAG structure combined 
with the local gating composition mechanism of AdaSent nat¬ 
urally help to avoid the gradient vanishing problem. However, 
the gradient exploding problem still exists as we observe in 
our experiments. In this section, we discuss our implemen¬ 
tation details to mitigate the gradient exploding problem and 
we give some practical tricks to improve the performance in 
the experiments. 


Regularization of Recurrent Matrix 

The root of the gradient exploding problem in recurrent neu¬ 
ral networks and other related models lies in the large spectral 
norm of the recurrent matrix as shown in Eq. [T^ and Eq. 
Suppose the spectral norm of Wl and Wr ^ 1, then the 
recursive application of Eq.[^and Eq.[^in the back prop¬ 
agation process will cause the norm of the gradient vector to 
explode. To alleviate this problem, we propose to penalize 
the Erobenius norm of the recurrent matrix, which acts as a 
surrogate (upper bound) of the corresponding spectral norm, 
since 1) it is computationally expensive to compute the exact 
value of spectral norm and 2) it is hard to establish a direct 
connection between the spectral norm and the model param¬ 
eters to incorporate it into our objective function. Let /!(•,•) 
be our objective function to minimize. Eor example, when C 
is the negative log-likelihood in the classification setting, our 
optimization can be formulated as 


13 


minimize C{'Xi,yi) + \ {\\Wl\\f + \\Wr\\f) 

(14) 

where is the training sequence and pi is the label. The 
value of the regularization coefficient A is problem depen¬ 
dent. In our experiments, typical values of A range from 
0.01 to 5 X 10~^. Eor all our e xperiments, we use minibatch 
AdaG rad |Duchi et al, 20 111 with the norm-clipping tech¬ 
nique [ Pascanu etal, 2013| to optimize the objective function 
in Eq. 14] 


^ https ://github.com/mesnilgr/iclr 15 






































































Implementation Details 

Throughout our experiments, we use a 50-dimensional word 
embedding trained using word2vec |Mikolov et al ., 
on the Wikipedia corpus (^IB words). The vocabulary size 
is about 300,000. For all the tasks, we fine-tune the word 
embeddings durin g training to improve the performance |Col- 
lobert et < 2 /., 2011|. We use the hyperbolic tangent function as 


the activation fun ction in the composition process as the rec¬ 
tified linear units i Nair and Hinton, 2010| are more prone to 
the gradient exploding problem in recurrent neural networks 
and its related variants. We use an MLP to implement the 
classifier on top of the hierarchy and use a softmax function 
to implement the gating network. We also tried using MLP to 
implement the gating network, but this does not improve the 
performance significantly. 


4.3 Experiment Results 


Model 

MR 

CR 

SUBJ 

MPQA 

TREC 

NB-SVM 

79.4 

81.8 

93.2 

86.3 

- 

MNB 

79.0 

80.0 

93.6 

86.3 

- 

RAE 

77.7 

- 

- 

86.4 

- 

MV-RecNN 

79.0 

- 

- 

- 

- 

CNN 

81.5 

85.0 

93.4 

89.6 

93.6 

DCNN 

- 

- 

- 

- 

93.0 

P.V. 

74.8 

78.1 

90.5 

74.2 

91.8 

cBoW 

77.2 

79.9 

91.3 

86.4 

87.3 

RNN 

77.2 

82.3 

93.7 

90.1 

90.2 

BRNN 

82.3 

82.6 

94.2 

90.3 

91.0 

GrConv 

76.3 

81.3 

89.5 

84.5 

88.4 

AdaSent 

83.1 

86.3 

95.5 

93.3 

92.4 


Table 2: Classification accuracy of AdaSent compared with 
other models. For NB-SVM, MNB, RAF, MV-RecNN, CNN 
and DCNN, we use the results reported in the corresponding 
paper. We use the public implementation of P. V. and we im¬ 
plement other methods. 


The classification accuracy of AdaSent compared with 
other models is shown in Table AdaSent consistently out¬ 
performs P.V., cBoW, RNN, BRNN and GrConv by a large 
margin while achieving comparable results to the state-of- 
the-art and using much fewer parameters: the number of pa¬ 
rameters in our models range from lOK to lOOK while in 
CNN the number of parameters is about 400K[^ AdaSent out¬ 
performs all the other models on the MPQA data set, which 
consists of short phrases (the average length of each instance 
in MPQA is 3). We attribute the success of AdaSent on 
MPQA to its power in modeling short phrases since long 
range dependencies are hard to detect and represent. 

Compared with BRNN, the level-wise global pooling in 
AdaSent helps to explicitly model phrases of different lengths 
while in BRNN the summarization process is more sensitive 
to a small range of nearby words. Hence, AdaSent consis¬ 
tently outperforms BRNN on all data sets. Also, AdaSent 
significantly outperforms GrConv on all the data sets, which 

^The state-of-the -art accuracy on TREC is 95.0 achieved 
by [Silva et a/., 2011| using SVM with 60 hand-coded features. 


indicates that the variable length multi-scale representation is 
key to its success. As a comparison, GrConv does not perform 
well because it fails to keep the intermediate representations. 
More results on using GrConv as a fixed-length sequence en- 
co der for machine tra nslation and related tasks can be found 
in i jCho et al., 2014| . cBoW is quite effective on some tasks 
(e.g., SUBJ). We think this is due to the language regularities 
encoded in the word vectors and also the characteristics of the 
data itself. It is surprising that P.V. performs worse than other 
methods on the MPQA data set. This may be due to the fact 
that the average length of instances in MPQA is small, which 
limits the number of context windows when training P.V.. 


Model 

MR 

CR 

SUBJ 

RV. 

71.11 ±0.80 

71.22 ± 1.04 

90.22 ±0.21 

cBoW 

RNN 

BRNN 

GrConv 

72.74 ± 1.03 
74.39 ±1.70 
75.25 ± 1.33 
71.64 ±2.09 

71.86 ±2.00 
73.81 ±3.52 
76.72 ± 2.78 
71.52 ±4.18 

90.58 ±0.52 
89.97 ±2.88 
90.93 ± 1.00 
86.53 ± 1.33 

AdaSent 

79.84 ± 1.26 

83.61 ± 1.60 

92.19 ±1.19 

Model 

MRQA 

TREC 


RV. 

67.93 ±0.57 

86.30 ± 1.10 


cBoW 

RNN 

BRNN 

GrConv 

84.04 ± 1.20 
84.52 ± 1.17 
85.36 ±1.13 
82.00 ±0.88 

85.16 ± 1.76 
84.24 ±2.61 
86.28 ±0.90 
82.04 ±2.23 


AdaSent 

90.42 ±0.71 

91.10 ± 1.04 



Table 3: Model variance. 


We also report model variance of P.V., cBoW, RNN, 
BRNN, GrConv and AdaSent in Table by running each of 
the models on every data set 10 times using different settings 
of hyper-parameters and random initializations. We report the 
mean classification accuracy and also the standard deviation 
of the 10 runs on each of the data set. Again, AdaSent consis¬ 
tently outperforms all the other competitor models on all the 
data sets. 

To study how the multi-scale hierarchy is combined by 
AdaSent in the final consensus, for each data set, we sam¬ 
ple two sentences with a pre-specified length and compute 
their corresponding belief scores. We visualize the belief 
scores of 10 sentences by a matrix shown in Fig. As 
illustrated in Fig. the distribution of belief scores varies 
among different input sentences and also different data sets. 
The gating network is trained to adaptively select the most 
appropriate representation in the hierarchy by giving it the 
largest belief score. We also give a concrete example from 
MR to show both the predictions computed from each level 
and their corresponding belief scores given by the gating net¬ 
work in Fig.l^ The first row in Fig. shows the belief scores 
Pr(Hx = Vf and the second row shows the proba¬ 

bility Pr(^ = lI'Hx = t), Vt predicted from each level in the 
hierarchy. In this example, although the classifier predicts in¬ 
correctly for higher level representations, the gating network 
assigns the first level with the largest belief score, leading to 
a correct final consensus. The fiexibility of multiscale repre¬ 
sentation combined with a gating network allows AdaSent to 
generalize GrConv in the sense that GrConv corresponds to 
the case where the belief score at the root node is 1.0. 
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Figure 5: Each row corresponds to the belief score of a sen¬ 
tence of length 12 sampled from one of the data sets. From 
top to bottom, the 10 sentences are sampled from MR, CR, 
SUBJ, MPQA and TREC respectively. 


True label = 0, Pr(y = l|x) = 0.318 
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Figure 6: Sentence: If the movie were all comedy it might 
work better but it has an ambition to say something about its 
subjects but not willingness. 



Figure 7: Different colors and patterns correspond to different 
objective classes. The first, second and third rows correspond 
to SUB J, MPQA and TREC respectively and the left and right 
columns correspond to AdaSent and cBoW respectively. 
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To show that AdaSent is able to automatically learn the 
appropriate representation for the task at hand, we visualize 
the first two principal components (obtained by PCA) of the 
vector with the largest weight in the hierarchicy for each sen¬ 
tence in the dataset. Lig.|7] shows the projected features from 
AdaSent (left column) and cBoW (right column) for SUBJ 
(1st row), MPQA (2nd row) and TREC (3rd row). During 
training, the model implicitly learns a data representation that 
enables better prediction. This property of AdaSent is very 
interesting since we do not explicitly add any separation con¬ 
straint into our objective function to achieve this. 

5 Conclusion 

In this paper, we propose AdaSent as a new hierarchical se¬ 
quence modeling approach. AdaSent explores a new direc¬ 
tion to represent a sequence by a multi-scale hierarchy instead 
of a fiat, fixed-length, continuous vector representation. The 
analysis and the empirical results demonstrate the effective¬ 
ness and robustness of AdaSent in short sequence modeling. 
Qualitative results show that AdaSent can learn to represent 
input sequences depending on the task at hand. 
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